CLIMATE 405: Machine Learning for Earth and Environmental Sciences; FALL 2024¶

Mohammed Ombadi (ombadi@umich.edu)¶

Lecture 6 (Monday, 09/16/2024)¶

Topics covered in this lecture:¶

  • Clustering: Hierarchical Algorithms
  • Clustering: Density-based Algorithms
  • Examples of Clustering in Earth and Environmental Sciences

Import libraries¶

In [2]:
# Import libraries
import os
import sys
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings

from scipy.integrate import odeint
from IPython.display import Image
from statistics import mode
from scipy.stats import pearsonr, spearmanr
from sklearn.feature_selection import mutual_info_regression
import statsmodels.api as sm
import pingouin as pg

# Suppress warnings
warnings.filterwarnings('ignore')

# Set number of decimals for np print options
np.set_printoptions(precision=3)

# Set the current working directory
os.chdir(sys.path[0])

Clustering: Hierarchical Algorithms
¶

Hierarchical algorithms can either be aggolmerative (start from individual data points and combine them into clusters) or divisive (start from one large cluster and then divide into smaller groups).

Agglomerative clustering (visual explanation):¶

In [9]:
display(Image(filename = 'Agglomerative clustering_animation.gif', width= 1000, height= 1000))
<IPython.core.display.Image object>

The clustering example above can be visualized as a dendrogram:

In [7]:
display(Image(filename = 'Dendrogram_agglomerative clustering.jpeg', width= 1000, height= 1000))
No description has been provided for this image

Agglomerative clustering (sklearn implementation):¶

Scikit-learn documentation is here

Let's apply this on the zoo dataset:

In [11]:
df = pd.read_csv('zoo_data.csv')
df
Out[11]:
animal_name hair feathers eggs milk airborne aquatic predator toothed backbone breathes venomous fins legs tail domestic catsize type
0 aardvark 1 0 0 1 0 0 1 1 1 1 0 0 4 0 0 1 1
1 antelope 1 0 0 1 0 0 0 1 1 1 0 0 4 1 0 1 1
2 bass 0 0 1 0 0 1 1 1 1 0 0 1 0 1 0 0 4
3 bear 1 0 0 1 0 0 1 1 1 1 0 0 4 0 0 1 1
4 boar 1 0 0 1 0 0 1 1 1 1 0 0 4 1 0 1 1
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
96 wallaby 1 0 0 1 0 0 0 1 1 1 0 0 2 1 0 1 1
97 wasp 1 0 1 0 1 0 0 0 0 1 1 0 6 0 0 0 6
98 wolf 1 0 0 1 0 0 1 1 1 1 0 0 4 1 0 1 1
99 worm 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 7
100 wren 0 1 1 0 1 0 0 0 1 1 0 0 2 1 0 0 2

101 rows × 18 columns

In [12]:
# Extract features and labels
X= df.iloc[:,1:-1]
y_true = df.iloc[:,-1]
In [15]:
# Perform agglomerative clustering 
from sklearn.cluster import AgglomerativeClustering
In [16]:
clustering = AgglomerativeClustering(n_clusters = 7).fit(X)
In [18]:
#print labels
clustering.labels_
Out[18]:
array([1, 1, 2, 1, 1, 1, 1, 2, 2, 1, 1, 3, 2, 6, 0, 0, 3, 1, 2, 2, 3, 3,
       1, 3, 0, 5, 5, 4, 1, 4, 0, 1, 4, 3, 2, 1, 1, 3, 2, 0, 0, 3, 0, 3,
       1, 1, 0, 1, 1, 1, 1, 0, 5, 0, 1, 1, 3, 3, 3, 3, 2, 2, 6, 5, 1, 1,
       2, 1, 1, 1, 1, 3, 0, 2, 2, 4, 6, 6, 3, 3, 6, 6, 2, 3, 4, 0, 2, 3,
       0, 5, 5, 5, 2, 4, 1, 3, 4, 0, 1, 6, 3])
In [26]:
# Compare true labels with the ones from clustering 
y_pred = clustering.labels_
result = pd.DataFrame(np.transpose(np.vstack((y_true, y_pred))), columns= ['y_true', 'y_label'])
result.iloc[:20,:]
Out[26]:
y_true y_label
0 1 1
1 1 1
2 4 2
3 1 1
4 1 1
5 1 1
6 1 1
7 4 2
8 4 2
9 1 1
10 1 1
11 2 3
12 4 2
13 7 6
14 7 0
15 7 0
16 2 3
17 1 1
18 4 2
19 1 2

Clustering: Density-based Algorithms
¶

One example of density-based methods is the DBSCAN (Density-based Spatial Clustering of Applications with Noise) algorithm

Here is a visual demonstration:

In [76]:
display(Image(filename = 'DB_scan_visual.gif', width= 1000, height= 1000))
<IPython.core.display.Image object>

Scikit-learn documentation can be accessed here

In [27]:
from sklearn.datasets import make_moons
from sklearn.cluster import DBSCAN
from sklearn.neighbors import NearestNeighbors
from scipy.spatial.distance import pdist, squareform
In [41]:
# Generate sample data
X, y = make_moons(n_samples=300, noise=0.1, random_state=0)
In [44]:
# Function to plot data
def plot_data(X, title, ax):
    ax.scatter(X[:, 0], X[:, 1], s=10)
    ax.set_title(title)
    ax.set_xlabel('Feature 1')
    ax.set_ylabel('Feature 2')
In [78]:
# Plot initial data
fig, axs = plt.subplots(2, 2, figsize=(12, 10))
plot_data(X, 'Initial Data', axs[0, 0])

# Compute the distance matrix
dist_matrix = squareform(pdist(X))

# Plot distance matrix
axs[0, 1].imshow(dist_matrix, cmap='hot', interpolation='nearest')
axs[0, 1].set_title('Distance Matrix')
axs[0, 1].set_xlabel('Point Index')
axs[0, 1].set_ylabel('Point Index')

# Apply DBSCAN
eps = 0.2
min_samples = 10
dbscan = DBSCAN(eps=eps, min_samples=min_samples)
labels = dbscan.fit_predict(X)

# Plot clustering result
axs[1, 0].scatter(X[:, 0], X[:, 1], c=labels, cmap='viridis', s=10)
axs[1, 0].set_title('DBSCAN Clustering Result')
axs[1, 0].set_xlabel('Feature 1')
axs[1, 0].set_ylabel('Feature 2')

# Highlight core samples
core_samples_mask = np.zeros_like(labels, dtype=bool)
core_samples_mask[dbscan.core_sample_indices_] = True

axs[1, 1].scatter(X[:, 0], X[:, 1], c=labels, cmap='viridis', s=10)
axs[1, 1].scatter(X[core_samples_mask, 0], X[core_samples_mask, 1], c='red', marker='o', s=50, edgecolor='k', label='Core Samples')
axs[1, 1].set_title('Core Samples Highlighted')
axs[1, 1].set_xlabel('Feature 1')
axs[1, 1].set_ylabel('Feature 2')
axs[1, 1].legend()

plt.tight_layout()
plt.show()
No description has been provided for this image

Applications of clustering in Earth and Environmental Scinces
¶

1) Using clustering to determine optimum sampling locations¶

In [52]:
display(Image(filename = 'Van Arkel & Kaleita_WRR2014.png', width= 750, height= 750))
No description has been provided for this image

Research questions:¶

  • Can K-means clustering be used to effectively identify critical sampling locations in an agricultural field?
  • Can the clustering be successful based on input data that does not include soil moisture?

Research site and data used in the study:¶

In [54]:
display(Image(filename = 'Fig1_Van Arkel & Kaleita_WRR2014.jpg', width= 600, height= 600))
No description has been provided for this image

Figure 1: Brooks Field sampling grid with elevation (shading, in m) and soil types. Points are on 50 m intervals. Soil type indices, according to the National Cooperative Soil Survey: 55: Nicollet loam, 1−3% slopes; 95: Harps loam, 1−3% slopes; 107: Webster clay loam, 0−2% slopes; 138B: Clarion loam, 2−5% slopes; 138C2: Clarion loam, 5−9% slopes, moderately eroded; 507: Canisteo clay loam, 0−2% slopes

This is an agricultural site in Iowa, with a size of 300*250 meter. Soil moisture measurements were taken during the growing season for 2004 - 2008

K-means clustering was applied with the following input data:¶

  • K-means $M_{\theta}$: uses soil moisture data ($\theta$) for the year 2004.
  • K-means $M_{T}$: uses topographic data
  • K-means $M_{E}$: uses electromagnetic inductance
In [61]:
display(Image(filename = 'Table1_Van Arkel & Kaleita_WRR2014.png', width= 1200, height= 1000))
No description has been provided for this image

Evaluation:¶

In [62]:
display(Image(filename = 'Table2_Van Arkel & Kaleita_WRR2014.png', width= 1200, height= 1000))
No description has been provided for this image

Read more about this study here¶

2) Using clustering to determine groundwater trends during and after the Australian Millennium Drought¶

In [63]:
display(Image(filename = 'Chen et al_ERL2024.png', width= 750, height= 750))
No description has been provided for this image

key research questions:¶

1- What are the spatial patterns of groundwater during and post drought?

Rainfall Anomalies during the drought (1997 - 2009)¶

In [67]:
display(Image(filename = 'Fig3_ERL2024.jpg', width= 750, height= 750))
No description has been provided for this image

Trends in Groundwater levels (clustered using K-means)¶

In [72]:
display(Image(filename = 'Fig4a_ERL2024.jpg', width= 1200, height= 900))
No description has been provided for this image

How did they determine the number of classes (K= 4)??¶

They used the elbow method, which computes the within cluster errors (aka distortion) as a function of K, and then find the inflection point.

In [75]:
display(Image(filename = 'Fig4b_ERL2024.jpg', width= 800, height= 800))
No description has been provided for this image

You can read more about this study here¶

In [ ]: